22 research outputs found

    Human evaluation and statistical analyses on machine reading comprehension, question generation and open-domain dialogue

    Get PDF
    Evaluation is a critical element in the development process of many natural language based systems. In this thesis, we will present critical analyses of standard evaluation methodologies applied in the following Natural Language Processing (NLP) domains: machine reading comprehension (MRC), question generation (QG), and open-domain dialogue. Generally speaking, systems from tasks like MRC are usually evaluated by comparing the similarity between hand-crafted references and system generated outputs using automatic evaluation metrics, thus these metrics are mainly borrowed from other NLP tasks that have been well-developed, such as machine translation and text summarization. Meanwhile, the evaluation of QG and dialogues is even a known open problem as such tasks do not have the corresponding references for computing the similarity, and human evaluation is indispensable when assessing the performance of the systems from these tasks. However, human evaluation is unfortunately not always valid because: i) human evaluation may cost too much and be hard to deploy when experts are involved; ii) human assessors can lack reliability in the crowd-sourcing environment. To overcome the challenges from both automatic metrics and human evaluation, we first design specific crowdsourcing human evaluation methods for these three target tasks, respectively. We then show that these human evaluation methods are reproducible, highly reliable, easy to deploy, and cost-effective. Additionally, with the data collected from our experiments, we measure the accuracy of existing automatic metrics and analyse the potential limitations and disadvantages of the direct application of these metrics. Furthermore, in allusion to the specific features of different tasks, we provide detailed statistical analyses on the collected data to discover their underlying trends, and further give suggestions about the directions to improving systems on different aspects

    QAScore -- An Unsupervised Unreferenced Metric for the Question Generation Evaluation

    Full text link
    Question Generation (QG) aims to automate the task of composing questions for a passage with a set of chosen answers found within the passage. In recent years, the introduction of neural generation models has resulted in substantial improvements of automatically generated questions in terms of quality, especially compared to traditional approaches that employ manually crafted heuristics. However, the metrics commonly applied in QG evaluations have been criticized for their low agreement with human judgement. We therefore propose a new reference-free evaluation metric that has the potential to provide a better mechanism for evaluating QG systems, called QAScore. Instead of fine-tuning a language model to maximize its correlation with human judgements, QAScore evaluates a question by computing the cross entropy according to the probability that the language model can correctly generate the masked words in the answer to that question. Furthermore, we conduct a new crowd-sourcing human evaluation experiment for the QG evaluation to investigate how QAScore and other metrics can correlate with human judgements. Experiments show that QAScore obtains a stronger correlation with the results of our proposed human evaluation method compared to existing traditional word-overlap-based metrics such as BLEU and ROUGE, as well as the existing pretrained-model-based metric BERTScore.Comment: 19 pages, 5 figures, 7 table

    Document-Level Machine Translation with Large Language Models

    Full text link
    Large language models (LLMs) such as Chat-GPT can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks. Taking document-level machine translation (MT) as a testbed, this paper provides an in-depth evaluation of LLMs' ability on discourse modeling. The study fo-cuses on three aspects: 1) Effects of Discourse-Aware Prompts, where we investigate the impact of different prompts on document-level translation quality and discourse phenomena; 2) Comparison of Translation Models, where we compare the translation performance of Chat-GPT with commercial MT systems and advanced document-level MT methods; 3) Analysis of Discourse Modelling Abilities, where we further probe discourse knowledge encoded in LLMs and examine the impact of training techniques on discourse modeling. By evaluating a number of benchmarks, we surprisingly find that 1) leveraging their powerful long-text mod-eling capabilities, ChatGPT outperforms commercial MT systems in terms of human evaluation. 2) GPT-4 demonstrates a strong ability to explain discourse knowledge, even through it may select incorrect translation candidates in contrastive testing. 3) ChatGPT and GPT-4 have demonstrated superior performance and show potential to become a new and promising paradigm for document-level translation. This work highlights the challenges and opportunities of discourse modeling for LLMs, which we hope can inspire the future design and evaluation of LLMs

    Semantic-aware dynamic retrospective-prospective reasoning for event-level video question answering

    Get PDF
    Event-Level Video Question Answering (EVQA) requires complex reasoning across video events to obtain the visual information needed to provide optimal answers. However, despite significant progress in model performance, few studies have focused on using the explicit semantic connections between the question and visual information especially at the event level. There is need for using such semantic connections to facilitate complex reasoning across video frames. Therefore, we propose a semantic-aware dynamic retrospective-prospective reasoning approach for video-based question answering. Specifically, we explicitly use the Semantic Role Labeling (SRL) structure of the question in the dynamic reasoning process where we decide to move to the next frame based on which part of the SRL structure (agent, verb, patient, etc.) of the question is being focused on. We conduct experiments on a benchmark EVQA dataset - TrafficQA. Results show that our proposed approach achieves superior performance compared to previous state-of-the-art models. Our code is publicly available at https://github.com/lyuchenyang/Semantic-aware-VideoQA}

    Is a video worth n × n Images? A highly efficient approach to transformer-based video question answering

    Get PDF
    Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one or more image encoders followed by interaction between frames and question. However, such schema incur significant memory use and inevitably slow down the training and inference speed. In this work, we present a highly efficient approach for VideoQA based on existing vision-language pre-trained models where we concatenate video frames to a n × n matrix and then convert it to one image. By doing so, we reduce the use of the image encoder from n 2 to 1 while maintaining the temporal structure of the original video. Experimental results on MSRVTT and TrafficQA show that our proposed approach achieves state-of-theart performance with nearly 4× faster speed and only 30% memory use. We show that by integrating our approach into VideoQA systems we can achieve comparable, even superior, performance with a significant speed up for training and inference. We believe the proposed approach can facilitate VideoQA-related research by reducing the computational requirements for those who have limited access to budgets and resources. Our code is publicly available at https://github.com/lyuchenyang/ Efficient-VideoQA for research use

    Azimuthal asymmetries in lepton-pair production at a fixed-target experiment using the LHC beams (AFTER)

    Full text link
    A multi-purpose fixed-target experiment using the proton and lead-ion beams of the LHC was recently proposed by Brodsky, Fleuret, Hadjidakis and Lansberg, and here we concentrate our study on some issues related to the spin physics part of this project (referred to as AFTER). We study the nucleon spin structure through pppp and pdpd processes with a fixed-target experiment using the LHC proton beams, for the kinematical region with 7 TeV proton beams at the energy in center-of-mass frame of two nucleons s=115\sqrt{s}=115 GeV. We calculate and estimate the cos2ϕ\cos2\phi azimuthal asymmetries of unpolarized pppp and pdpd dilepton production processes in the Drell--Yan continuum region and at the ZZ-pole. We also calculate the sin(2ϕϕS)\sin(2\phi-\phi_S), sin(2ϕ+ϕS)\sin(2\phi+\phi_S) and sin2ϕ\sin2\phi azimuthal asymmetries of pppp and pdpd dilepton production processes with the target proton and deuteron longitudinally or transversally polarized in the Drell--Yan continuum region and around ZZ resonances region. We conclude that it is feasible to measure these azimuthal asymmetries, consequently the three-dimensional or transverse momentum dependent parton distribution functions (3dPDFs or TMDs), at this new AFTER facility.Comment: 15 pages, 40 figures. Version accepted for publication in EPJ

    B_c meson rare decays in the light-cone quark model

    Full text link
    We investigate the rare decays BcDs(1968)ˉB_c \rightarrow D_s(1968) \ell \bar{\ell} and BcDs(2317)ˉB_c\rightarrow D_s^*(2317) \ell \bar{\ell} in the framework of the light-cone quark model (LCQM). The transition form factors are calculated in the space-like region and then analytically continued to the time-like region via exponential parametrization. The branching ratios and longitudinal lepton polarization asymmetries (LPAs) for the two decays are given and compared with each other. The results are helpful to investigating the structure of BcB_c meson and to testing the unitarity of CKM quark mixing matrix. All these results can be tested in the future experiments at the LHC.Comment: 9 pages, 11 figures, version accepted for publication in EPJ

    Is a Video worth n×nn\times n Images? A Highly Efficient Approach to Transformer-based Video Question Answering

    Full text link
    Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one or more image encoders followed by interaction between frames and question. However, such schema would incur significant memory use and inevitably slow down the training and inference speed. In this work, we present a highly efficient approach for VideoQA based on existing vision-language pre-trained models where we concatenate video frames to a n×nn\times n matrix and then convert it to one image. By doing so, we reduce the use of the image encoder from n2n^{2} to 11 while maintaining the temporal structure of the original video. Experimental results on MSRVTT and TrafficQA show that our proposed approach achieves state-of-the-art performance with nearly 4×4\times faster speed and only 30% memory use. We show that by integrating our approach into VideoQA systems we can achieve comparable, even superior, performance with a significant speed up for training and inference. We believe the proposed approach can facilitate VideoQA-related research by reducing the computational requirements for those who have limited access to budgets and resources. Our code will be made publicly available for research use

    Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering

    Full text link
    Event-Level Video Question Answering (EVQA) requires complex reasoning across video events to obtain the visual information needed to provide optimal answers. However, despite significant progress in model performance, few studies have focused on using the explicit semantic connections between the question and visual information especially at the event level. There is need for using such semantic connections to facilitate complex reasoning across video frames. Therefore, we propose a semantic-aware dynamic retrospective-prospective reasoning approach for video-based question answering. Specifically, we explicitly use the Semantic Role Labeling (SRL) structure of the question in the dynamic reasoning process where we decide to move to the next frame based on which part of the SRL structure (agent, verb, patient, etc.) of the question is being focused on. We conduct experiments on a benchmark EVQA dataset - TrafficQA. Results show that our proposed approach achieves superior performance compared to previous state-of-the-art models. Our code will be made publicly available for research use
    corecore